-
Notifications
You must be signed in to change notification settings - Fork 115
Index nested fields #365
base: master
Are you sure you want to change the base?
Index nested fields #365
Conversation
d4cd2e8
to
45fe1ce
Compare
src/main/scala/com/microsoft/hyperspace/index/sources/FileBasedSourceProviderManager.scala
Outdated
Show resolved
Hide resolved
src/main/scala/com/microsoft/hyperspace/index/rules/FilterIndexRule.scala
Outdated
Show resolved
Hide resolved
src/main/scala/com/microsoft/hyperspace/index/rules/RuleUtils.scala
Outdated
Show resolved
Hide resolved
BTW thanks for the great work! and sorry for the delay in reviewing.. 🐢🦥 I'll try to do the review asap. |
Could you add logical/spark plan change in the PR description?
|
@sezruby Thanks for reviewing this PR. I added some more code to support nested fields in joins and all hybrid scans. I'll create start integrating your feedback. |
This comment has been minimized.
This comment has been minimized.
Thanks for reporting this. Build nodes must have changed. @sezruby do you have time to take a look? Otherwise, I will get to this toward the end of this week. We can rely on Scala 2.12 pipeline meanwhile. |
2223313
to
59b09a1
Compare
@andrei-ionescu I know this change all comes together, but could you have split this PR into small PRs by any chance? You can keep this PR for reference and I suggest following 3 PRs:
BTW thanks for the plan update :) seems it works well as expected 👍 |
@sezruby In regards to review effort, I won't help too much because creation and refresh parts are very small and the other two are equally in code change. I did tackled the feature like this:
I don't understand why you prefer separate PRs instead of separate commits - you can review each commit separately. Do you want to merge them separately? If so, wouldn't Hyperspace be in inconsistent state feature wise? |
@andrei-ionescu I have the same experience when I was working on Hybrid Scan last year - #123 (one big PR, but rejected 🙅) and I split the PR into several PRs (#150).
Since these commits become 1 commit after merging to master branch.
It's fine as long as they don't break the build & test. Create & refresh can be small but it requires some utility functions & test setup. |
@sezruby The only advantage for separate PRs is if you specifically intend to merge them separately. For better reading and understanding of the code, we can still keep a single PR but with more commits. For me building separate PRs means a lot of extra work as I need to keep al those branches up to date through constant rebases and link them one on top of the other. There are multiple changes in the same file and any rebase will result in as many more times of conflict resolving as the branches I create for these PRs. I would like not to go through this ordeal. BTW: What does "CL" stand for? I've seen it in the developer docs but never explained clearly. |
+1 for smaller PRs if possible. For this PR, I think it would make sense to split as @sezruby suggested above since they seem easily splittable? The rationale behind the smaller PRs is described in the doc shared above. Especially, when we have a PR that's +2500 lines long, it's really hard for reviewers to review quickly and correctly. (and reviewing by commits doesn't seem possible when each commit is also big)
Since you have this "reference" PR, you can just create a PR one by one instead of creating all of them at once? In this way, you don't need to keep many branches up to date.
Thanks for working on this great feature. I think smaller PRs will help both sides to iterate quicker. But if you think it's not feasible, you can still keep this PR as one, but please bear with us if it takes a very long time to get it reviewed/merged. Thanks! (I believe CL stands for change list)? |
@imback82 Thanks for your opinion. If you can guarantee that it will take less time and a better team focus to get everything merged as small PRs than this big PR, then I can see the its advantage. Otherwise it's just extra work on my side. BTW, It is nowhere specified in any docs that in Hyperspace, one should be using Google's Small CLS engineering protocols. |
59b09a1
to
19d26f8
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did one round except for RuleUtils + tests
@@ -65,9 +65,15 @@ class CreateAction( | |||
} | |||
|
|||
private def isValidIndexSchema(config: IndexConfig, schema: StructType): Boolean = { | |||
// Flatten the schema to support nested fields |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// Flatten the schema to support nested fields | |
// Flatten the schema to support nested fields. |
val dfColumnNames = df.schema.fieldNames | ||
val indexedColumns = indexConfig.indexedColumns | ||
val includedColumns = indexConfig.includedColumns | ||
val dfColumnNames = SchemaUtils.flatten(df.schema) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you add some comment here? for flatten
operation
val indexedColumns = indexConfig.indexedColumns | ||
val includedColumns = indexConfig.includedColumns | ||
val dfColumnNames = SchemaUtils.flatten(df.schema) | ||
val indexedColumns = SchemaUtils.unescapeFieldNames(indexConfig.indexedColumns) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you add some comment why "unescapeFieldNames" required here? e.g. nested column names are stored as escaped in index log entry.
@@ -109,4 +109,7 @@ object IndexConstants { | |||
// To provide multiple paths in the globbing pattern, separate them with commas, e.g. | |||
// "/temp/1/*, /temp/2/*" | |||
val GLOBBING_PATTERN_KEY = "spark.hyperspace.source.globbingPattern" | |||
|
|||
// Indicate whether the index has been built over a nested field. | |||
private[hyperspace] val USES_NESTED_FIELDS_PROPERTY = "hasNestedFields" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you move this up around 104 lines?
assert(parts.forall(_.isInstanceOf[HashPartitioning])) | ||
assert(parts.forall(_.numPartitions == bucketSpec.numBuckets)) | ||
|
||
val reduced = parts.reduceLeft { (a, b) => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you add some comments about reduced
with some example?
|
||
/** | ||
* Returns true if the given project is a supported project. If all of the registered | ||
* providers return None, this returns false. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you revise the comment?
|
||
/** | ||
* Returns true if the given filter is a supported filter. If all of the registered | ||
* providers return None, this returns false. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto
/** | ||
* Given a nested field this method extracts the full name out of it. | ||
* | ||
* @param field The field from which to get the name from |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
super nit & ditto below functions
* @param field The field from which to get the name from | |
* @param field The field from which to get the name from. |
object SchemaUtils { | ||
|
||
val NESTED_FIELD_NEEDLE_REGEX = "\\." | ||
val NESTED_FIELD_REPLACEMENT = "__" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is it safe to use double underscores?
* @param index The chosen index | ||
* @return A collection of nested field names | ||
*/ | ||
private def getNestedFields(index: IndexLogEntry): Seq[String] = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's better to move these two func to IndexLogEntry. WDYT?
Yes, I think it will definitely accelerate the review process. @sezruby, do you also agree after doing reviews on this PR?
We have been referring to the doc internally if there is a conflict in the discussion. We can make it a formal process if needed; note that we are still improving the process in this repo as needed/required. |
Yea as most of functionality is validated in this PR, we can do small PRs only for merging them partially. It's too long scrooooll and got distracted in the middle of reviewing :D .. @andrei-ionescu with this PR, you won't be able to get the review from @imback82 this year lol |
I'm just saying that I would prefer to know it before hand and it would have been very useful to me to have it in the dev process docs. I would have started with it in mind and would ease up may work load. Keeping in mind that bigger features are going to come and that you desire the Hyperspace community to grow, It would be very helpful to have it stated in your docs for further big chunks of code changes. @imback82, @sezruby: To set this clear so I wouldn't get through this process of "re-creating PRs again and again", look at the the following set of PRs that I want to create for this feature:
Q1: Can we agree on the PRs lists above? |
@andrei-ionescu |
Sure, I will update the contribution guide.
The list looks reasonable to me.
Sorry about reviewing the PRs late. We were not expecting a big feature from the external contributors, so I acknowledge that we didn't allocate resources correctly. We will try to fix this moving forward. And to answer your concern, yes, once the PRs are in a manageable size, we will review them as soon as we can. I expect you create these PRs one by one, and not all at once? |
@imback82 I will create them as my time allows. If I'll be able to create all of them at once then I'll do it like that. What I can say is that I'll start creating them in the agreed order. I'll ping you guys on each of them as they land. |
What is the context for this pull request?
What changes were proposed in this pull request?
This PR adds support for indexing over nested fields (ie: structs).
The first commit is adding support for building the index over a nested (struct) field and support for modifying the search query to properly use that index. It has suport for hybrid scans for both append and delete files in the hybrid scan context.
The second commit will address the join use case.
Does this PR introduce any user-facing change?
No.
How was this patch tested?
Existing unit and integration tests for not breaking the existing functionalities.
Unit and integration test added for the new functionalities.
Search queries
The following search queries use a dataset with the following schema:
Files Appended
The optimized plan
The Spark plan
Files Deleted
The optimized plan
The Spark plan
Join queries
The following join queries will have a dataset a bit different from the one at the beginning. The following are extracted from the
HybridScanForNestedFieldsTest
tests.Join append only
Original plan
Altered optimized plan
Altered Spark plan
Delete files
Original plan
Altered optimized plan
Altered Spark plan
Append + Delete
Original plan
Altered optimized plan
Altere Spark plan